Add MPS support with fused kernels #76

andrewkchan · 2024-04-14T02:40:16Z

This PR adds support for GPU acceleration via the MPS backend on MacOS per #60.

I ported the custom gsplat PyTorch ops with fused kernels for gaussian projection, rasterization, etc. to metal performance shaders.
I tested the app compiled with MPS support on the banana example. It works!
I also tested using the unit tests in the gsplat repo (I will probably upstream these changes at some point). The backward pass jacobians for some ops are sometimes a little off the pure PyTorch jacobians but it doesn't seem to make a difference when testing with real examples.

Here's the speedup on my M3 Pro with MacOS Sonoma 14.3. Wall clock goes from 5 minutes to 5 seconds!

GPU

time ./opensplat ~/Downloads/banana/ -n 100
./opensplat ~/Downloads/banana/ -n 100  3.84s user 1.54s system 90% cpu 5.920 total

CPU

time ./opensplat ~/Downloads/banana/ -n 100 --cpu
./opensplat ~/Downloads/banana/ -n 100 --cpu  647.85s user 93.12s system 227% cpu 5:26.11 total

Some implementation notes:

I added some logs I found useful while developing indicating when resources are loaded and printing errors. I couldn't find any practices on different log levels (debug, info, warn, error, etc.) so right now everything is visible to the user.
Global resources like compute pipeline state objects and command queues are allocated lazily then leak forever for the lifetime of the program, I think this is ok because we're only allocating a very small fixed number of these and they'll be cleaned up when the application exits.

It was very useful to generate an XCode project from the CMakeLists.txt for debugging this for future reference, since XCode provides some nice GPU tools.

pfxuan · 2024-04-14T06:31:50Z

@andrewkchan Thank you for your fantastic work in making the MPS backend finally work! I did a quick test for this PR and found it works almost faultlessly from my laptop. My test setup is based on a customized M2 Max (30 of the 38 GPU cores).

CMake and XCode build:

cmake -G Xcode -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON  ../

./opensplat ~/Data/banana -n 100  3.76s user 1.75s system 100% cpu 5.465 total

CMake and make build:

cmake -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON  ../

./opensplat ~/Data/banana -n 100  4.05s user 1.78s system 101% cpu 5.741 total

Perhaps this PR is the first time to make it become possible to natively running 3DGS workload on apple silicon. I'm super excited for this new feature and expecting it would add a new impact in this area soon 🚀

josephldobson · 2024-04-14T07:42:23Z

Great work! very excited to start using this :)

pierotofy · 2024-04-14T12:36:08Z

Agree! Very exciting @andrewkchan, thanks for the PR. I can't wait to run this on my M1 🙏

I will test/review this sometimes by today or tomorrow.

Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?

pierotofy · 2024-04-14T17:58:41Z

Currently getting this error while trying to run:

...
init_gsplat_metal_context: load function rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: error: load pipeline error: Error Domain=AGXMetal13_3 Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}

Also noticed that the definition for rasterize_forward_kernel_cpso might be missing?

pfxuan · 2024-04-14T18:09:20Z

The error might be coupled with your macos version (13.3 vs 14.4). I was able to run through metal compile with the latest 14.4.1. Maybe we should consider tweaking cmake like ggerganov/llama.cpp#6370 did.

Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.

pierotofy · 2024-04-14T18:24:19Z

Yep, I'm on 13.2. I'll try a few things, see if I can get it to compile.

Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.

andrewkchan · 2024-04-14T19:19:06Z

Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?

Hmm, I don't remember changing this code. Looks like it was from the experimental commit that I based my changes on 472a45a

Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.

Yeah, I had ported over the ND rasterize function instead of the rasterize by accident but then decided to just use that. It's possible this was causing the slight numerical differences in unit tests. Happy to port over the rasterize_forward_kernel if needed.

Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.

Curious what problem you are running into! Since as noted in the OP I'm intentionally leaking some resources forever.

pfxuan · 2024-04-14T19:29:35Z

The memory consumption keeps growing when increasing the training iter num. e.g. 2000 iters will accumulate 12+ GB GPU memory.

I added emptyCache for MPS device. But it seems like there is no visible improvement.

#elif defined(USE_MPS)
#include <torch/mps.h>

        if (device != torch::kCPU){
            #ifdef USE_HIP
                    c10::hip::HIPCachingAllocator::emptyCache();
            #elif defined(USE_CUDA)
                    c10::cuda::CUDACachingAllocator::emptyCache();
            #elif defined(USE_MPS)
                    at::detail::getMPSHooks().emptyCache();
            #endif
        }

Also, adding or enforcing autoreleasepool across all methods in gsplat_metal.mm doesn't have any help so far. The root cause is probably linked with the deep copy of mps tensor:

pierotofy · 2024-04-14T19:42:48Z

Happy to port over the rasterize_forward_kernel if needed.

No need, but could certainly be done as part of another PR.

I'm still trying to compile this on 13.2; I've isolated the problem to a call to simd_shuffle_and_fill_down, but it's strange since my machine should have Metal 3 and that function has been available since 2.4.. investigating.

andrewkchan · 2024-04-14T20:38:10Z

What is the expected memory usage for something like the banana example? It's not great if there is a leak. But I'm not able to find anything using the XCode "Leaks" tool except for two allocations of 128 byte objects. And I thought that memory usage is generally expected to increase over training because the number of gaussians will increase with scheduled splits.

pfxuan · 2024-04-14T20:40:03Z

We can use this table as a baseline #3 (comment)

pfxuan · 2024-04-14T21:20:08Z

For 2000 iters, the combined memory consumption is around 5.5 GB (CPU: 4.1 GB, GPU: 1.4 GB) from cuda version:

andrewkchan · 2024-04-14T21:31:02Z

Here's what I'm seeing around 1900 iters on the banana (8.7GB):

I wasn't aware of the at::detail::getMPSHooks() you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory() and at::detail::getMPSHooks().getDriverAllocatedMemory() and the numbers are way off. I wonder if that interface is not managing what we want.

pierotofy · 2024-04-14T21:46:19Z

I ended up upgrading to 14.4 and it now runs 🥳

I think there might be something off with the rasterize forward pass however, this is the result of the metal renderer after 100 iters on the banana dataset (you can do even just 10 iters):

./opensplat -n 100 --val-render render ./banana
Step 100: 0.326634

Compared to the CPU run:

./opensplat -n 100 --val-render render ./banana --cpu
Step 100: 0.188744

Looks like a width/height mismatch. I had these issues when writing the CPU rasterizer, I recommend using the simple_trainer app with different --width and --height parameters to test (by default they are equal), e.g. ./simple_trainer --width 256 --height 128 --render render

andrewkchan · 2024-04-14T21:53:00Z

Nice catch! You are totally right. Fixed and the loss is much lower after 100 iters now - 0.15966 vs 0.286877 for me.

pfxuan · 2024-04-14T23:36:08Z

Here's what I'm seeing around 1900 iters on the banana (8.7GB):

I wasn't aware of the at::detail::getMPSHooks() you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory() and at::detail::getMPSHooks().getDriverAllocatedMemory() and the numbers are way off. I wonder if that interface is not managing what we want.

I'm thinking the size of the combined memory we got from mps backend is likely within the correct range. With emptyCache(), I was able to achieve a slightly lower memory footprint (7.8GB vs 8.7GB). The latest Pytorch doesn't provide a corresponding C++ API (such as c10::mps::MPSCachingAllocator::emptyCache()) to explicitly release the MPS cache. Current workaround (at::detail::getMPSHooks().emptyCache()) was discovered from this python API and I'm still uncertain if it's truly effective.

pfxuan · 2024-04-14T23:53:58Z

Never mind, the memory footprint bumped to 9.1 GB (2k) after changing the the iter num (3k to 30K)

./opensplat ~/Data/banana -n 30000  228.13s user 12.47s system 81% cpu 4:53.68 total

pierotofy · 2024-04-15T00:48:52Z

Nice! This is looking pretty amazing and can be merged. Thanks for everyone's help. 👍

Memory improvements, as well as a possible port of the rasterize_forward_kernel_cpso function can be done as part of a separate contribution. About the latter, I noticed that the CPU rasterizer tends to converge a bit faster on the banana dataset (with downscale: 2, loss: ~0.15 after 100 iters vs 0.17), not sure if it's just random, but might be worth to investigate.

pfxuan · 2024-04-15T01:07:22Z

Great! I'll enable the MPS backend CI build via another PR. From my latest test, the combined memory footprint reaches 30.6 GB at the step 8800:

./opensplat ~/Data/banana -n 30000  2923.88s user 104.37s system 73% cpu 1:08:21.08 total

pfxuan and others added 14 commits April 13, 2024 19:05

Add initial test for MPS backend

472a45a

convert to CPU before using CPU rasterizer

b880671

scaffolding for metal kernels

52cfc10

wip

cfcbf19

..

574b662

macros

90f1e02

..

8055f11

replace macro with helper fn

38cfc99

use simd groups

8125246

fixes

1f541ee

fixes and cleanup

06d38c7

remaining functions

6ce8073

cleanup

f258ee4

readme

b899138

andrewkchan mentioned this pull request Apr 14, 2024

Mac/Windows Support #3

Closed

Merge branch 'main' into achan/mps-backend

1c11b4d

andrewkchan added 2 commits April 14, 2024 12:21

revert changes to gsplat_cpu.cpp

fe78f58

missing delete

e97adb8

fix w/h swap

b6c877e

ditto for nd_rasterize

043c9b2

Revert RasterizeGaussiansCPU changes

8c376c8

pierotofy merged commit 373b337 into pierotofy:main Apr 15, 2024
29 checks passed

This was referenced Apr 15, 2024

Use metal accelerated version of PyTorch #68

Closed

Add MSL support #47

Closed

andrewkchan mentioned this pull request May 28, 2024

MPS Support? nerfstudio-project/gsplat#163

Open

pfxuan mentioned this pull request Jun 30, 2024

Build Failed #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MPS support with fused kernels #76

Add MPS support with fused kernels #76

andrewkchan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

josephldobson commented Apr 14, 2024

pierotofy commented Apr 14, 2024 •

edited

Loading

pierotofy commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pierotofy commented Apr 14, 2024

andrewkchan commented Apr 14, 2024 •

edited

Loading

pfxuan commented Apr 14, 2024 •

edited

Loading

pierotofy commented Apr 14, 2024

andrewkchan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

andrewkchan commented Apr 14, 2024 •

edited

Loading

pierotofy commented Apr 14, 2024 •

edited

Loading

andrewkchan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pierotofy commented Apr 15, 2024 •

edited

Loading

pfxuan commented Apr 15, 2024

Add MPS support with fused kernels #76

Add MPS support with fused kernels #76

Conversation

andrewkchan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

josephldobson commented Apr 14, 2024

pierotofy commented Apr 14, 2024 • edited Loading

pierotofy commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pierotofy commented Apr 14, 2024

andrewkchan commented Apr 14, 2024 • edited Loading

pfxuan commented Apr 14, 2024 • edited Loading

pierotofy commented Apr 14, 2024

andrewkchan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

andrewkchan commented Apr 14, 2024 • edited Loading

pierotofy commented Apr 14, 2024 • edited Loading

andrewkchan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pfxuan commented Apr 14, 2024

pierotofy commented Apr 15, 2024 • edited Loading

pfxuan commented Apr 15, 2024

pierotofy commented Apr 14, 2024 •

edited

Loading

andrewkchan commented Apr 14, 2024 •

edited

Loading

pfxuan commented Apr 14, 2024 •

edited

Loading

andrewkchan commented Apr 14, 2024 •

edited

Loading

pierotofy commented Apr 14, 2024 •

edited

Loading

pierotofy commented Apr 15, 2024 •

edited

Loading